Skip to content

feat: idoctags serialization and deserialization matching the iso proposal#457

Merged
PeterStaar-IBM merged 29 commits intomainfrom
dev/align-idoctags-to-iso
Dec 17, 2025
Merged

feat: idoctags serialization and deserialization matching the iso proposal#457
PeterStaar-IBM merged 29 commits intomainfrom
dev/align-idoctags-to-iso

Conversation

@PeterStaar-IBM
Copy link
Member

@PeterStaar-IBM PeterStaar-IBM commented Dec 11, 2025

IDocTags Serialization Implementation

Overview

Implements bidirectional serialization between DoclingDocument and IDocTags format—a specialized XML-based markup language for structured document representation with geometric and semantic annotations.

Serialization Features

Core Capabilities:

  • Two output modes: HUMAN_FRIENDLY (indented, readable) and LLM_FRIENDLY (compact, tokenizer-optimized)
  • Geometric encoding: Bounding boxes quantized to configurable resolution (default: 512×512) via tokens
  • Semantic markup: Rich vocabulary covering titles, headings, text, captions, lists, forms, tables, pictures, code, formulas
  • OTSL table structure: Optimized Table Sequence Language with cell types (fcel/ecel, ched/rhed/corn, lcel/ucel/xcel, nl)
  • Content control: add_content parameter allows structure-only serialization (omits text while preserving tags)
  • XML compliance mode: Optional escaping of special characters (&, <, >) for valid XML output
  • Deserialization: Round-trip support to reconstruct DoclingDocument from IDocTags

Current Test Coverage

  • Vocabulary helpers (create_closing_token validation, self-closing token handling)
  • Content suppression correctness (captions, table cells, list items)
  • Multi-mode serialization (with/without content, human/LLM-friendly)
  • Metadata serialization and XML escaping
  • Round-trip edge cases

Outstanding Work (with FIXME's)

  1. Inline groups with multi-provenance (idoctags.py:1093) — Current split-per-provenance logic may need enhancement for inline groups
  2. Checkbox token generation (lines 1174, 1179) — Add dedicated create_selected_token() method to vocabulary
  3. Catch-all label handling (lines 1183-1184) — Refine mapping for EMPTY_VALUE, HANDWRITTEN_TEXT, PARAGRAPH, etc.; EMPTY_VALUE may need FormItem representation
  4. OTSL cell logic verification (lines 1564, 1570) — Validate rowstart/colstart conditions for UCEL/LCEL tokens
  5. TABLE vs DOCUMENT_INDEX distinction (line 1602) — Check label to emit correct floating group type
  6. We still need to do the FormItems
  7. we still need to take care of the <thread id="int"> and page-breaks. This will likely need some updates to the BaseSerializer. As such, I want to not include it in this PR.
  8. We have a few tests in ./test/test_deserializer_idoctags.py that do not pass. Currently, they are skipped but we need to make them work.

Testing

Dump Mode Usage

Serialize DoclingDocuments from HuggingFace datasets to IDocTags format and generate a validation report:

python examples/convert_to_idoctags.py --mode dump [--config CONFIG.json] [--limit N]

What it does:

  • Loads documents from a HuggingFace dataset (default: docling-project/doclaynet-set-a)
  • Serializes them to IDocTags in multiple variants (LLM/human-friendly, with/without content, XML-compliant)
  • Generates an Excel report (./scratch/idoctags_report.xlsx) tracking success/failure for each serialization mode
  • Optionally limits processing to first N documents with --limit

Config file:

If --config is omitted, a default config (idoctags_dump_config.json) is auto-generated. Key settings: dataset_name, dataset_subset, output_dir, report_path, limit.

Use --write-default-config to generate the config template without running the dump.

The result of,

uv run python ./examples/convert_to_idoctags.py --mode dump --config ./idoctags_dump_config.json

is

Wrote report (Excel via pandas) to: scratch/idoctags_report.xlsx
Overview summary:
 - Total processed: 3544
 - Loaded DoclingDocument: 3544
 - Serialized IDocTags (human_friendly, xml_compliant=True, content=True): 3529
 - Serialized IDocTags (human_friendly, xml_compliant=True, content=False): 3544
 - Serialized IDocTags (human_friendly, xml_compliant=False, content=True): 3034
 - Serialized IDocTags (human_friendly, xml_compliant=False, content=False): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=True, content=True): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=True, content=False): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=False, content=True): 3544
 - Serialized IDocTags (llm_friendly, xml_compliant=False, content=False): 3544
 - Serialized HTML: 3541

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@mergify
Copy link

mergify bot commented Dec 11, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@github-actions
Copy link
Contributor

github-actions bot commented Dec 11, 2025

DCO Check Passed

Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉

@codecov
Copy link

codecov bot commented Dec 11, 2025

Codecov Report

❌ Patch coverage is 60.00000% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/transforms/serializer/common.py 55.55% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@PeterStaar-IBM PeterStaar-IBM marked this pull request as ready for review December 12, 2025 16:25
@dosubot
Copy link

dosubot bot commented Dec 12, 2025

Related Documentation

Checked 8 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

…cTags

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@PeterStaar-IBM PeterStaar-IBM changed the title feat: aling idoctags to iso feat: align idoctags to iso Dec 15, 2025
@dolfim-ibm dolfim-ibm changed the title feat: align idoctags to iso feat: idoctags serialization and deserialization matching the iso proposal Dec 16, 2025
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this comment since the dataset is not public.

vagenas
vagenas previously approved these changes Dec 17, 2025
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@PeterStaar-IBM PeterStaar-IBM merged commit dda9c88 into main Dec 17, 2025
12 of 13 checks passed
@PeterStaar-IBM PeterStaar-IBM deleted the dev/align-idoctags-to-iso branch December 17, 2025 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments